Survey of Research on Chunking Techniques
نویسنده
چکیده
The explosive growth of data produced by different devices and applications has contributed to the abundance of big data. To process such amounts of data efficiently, strategies such as De-duplication has been employed. Among the three different levels of de-duplication named as file level, block level and chunk level, De-duplication at chunk level also known as byte level is the most popular and widely deployed. Many chunking techniques are also available which are categorised as Whole File Chunking, Fixed Size Chunking (FSC) and Content Defined Chunking (CDC). The objective of this paper is to analyse the performance of different existing chunking techniques based on their characteristics. In this study the significance of each technique provides insight to enable researchers understand and select a technique for their research.
منابع مشابه
تعیین مرز و نوع عبارات نحوی در متون فارسی
Text tokenization is the process of tokenizing text to meaningful tokens such as words, phrases, sentences, etc. Tokenization of syntactical phrases named as chunking is an important preprocessing needed in many applications such as machine translation information retrieval, text to speech, etc. In this paper chunking of Farsi texts is done using statistical and learning methods and the grammat...
متن کاملDynamic Chunking for Out-of-Core Volume Visualization Applications
Given the size of today’s data, out-of-core visualization techniques are increasingly important in many domains of scientific research. In earlier work a technique called dynamic chunking [1] was proposed that can provide significant performance improvements for an out-of-core, arbitrary direction slicer application. In this work we validate dynamic chunking for several common data access patte...
متن کاملLearning High Utility Rules by Incorporating Search Control
Many learning systems must confront the problem of run time after learning being greater than run time before learning. This utility problem has been a particular focus of research in explanation-based learning. This research focuses on the expensive chunk problem in which individual learned rules are so expensive to match that the system suffers a slow down from learning. In pastwork, there ha...
متن کاملOn the Automatic Generation of Cases Libraries by Chunking Chess Games
As a research topic computer game playing has contributed problems to AI that manifest exponential growth in the problem space. For the most part, in games such as chess and checkers these problems have been surmounted with enormous computing power on brute-force search methods using massive databases. It remains to be seen whether such techniques will extend to other games such as go and shogi...
متن کاملMinimally Supervised Japanese Named Entity Recognition: Resources and Evaluation
Approaches to named entity recognition that rely on hand-crafted rules and/or supervised learning techniques have limitations in terms of their portability into new domains as well as in the robustness over time. For the purpose of overcoming those limitations, this paper evaluates named entity chunking and classi cation techniques in Japanese named entity recognition in the context of minimall...
متن کامل